Classification of a Massive Number of Viral Genomes and Estimation of Time of Most Recent Common Ancestor (tMRCA) of SARS-CoV-2 Using Phylodynamic Analysis

Estimating the time of most recent common ancestor (tMRCA) is important to trace the origin of pathogenic viruses. This analysis is based on the genetic diversity accumulated in a certain time period. There have been thousands of mutant sites occurring in the genomes of SARS-CoV-2 since the COVID-19 pandemic started; six highly linked mutation sites occurred early before the start of the pandemic and can be used to classify the genomes into three main haplotypes. Tracing the origin of those three haplotypes may help to understand the origin of SARS-CoV-2. In this article, we present a complete protocol for the classification of SARS-CoV-2 genomes and calculating tMRCA using Bayesian phylodynamic method. This protocol may also be used in the analysis of other viral genomes. Key features • Filtering and alignment of a massive number of viral genomes using custom scripts and ViralMSA. • Classification of genomes based on highly linked sites using custom scripts. • Phylodynamic analysis of viral genomes using Bayesian evolutionary analysis sampling trees (BEAST). • Visualization of posterior distribution of tMRCA using Tracer.v1.7.2. • Optimized for the SARS-CoV-2.


Background
Revealing the origins of pathogenic viruses, crucial for cutting them off from the root and preventing future spillover, requires long-term hard work from scientists all around the world [1].Although some infectious pathogens can be traced back decades, the debate on their origin continues.For example, AIDS was officially reported on June 5, 1981, by the Centers for Disease Control and Prevention of the USA.Five years later, HIV infection was detected in a human serum sample collected in Léopoldville in early 1959 [2].Bayesian phylodynamic analyses using recovered viral gene sequences from decades-old paraffin-embedded tissues traced the most recent common ancestor (MRCA) of the M group of HIV back to approximately 1908 (CI 1884(CI -1924)), suggesting that HIV has been circulating in the human population for approximately 100 years [3].MERS-CoV is another example, as it was first reported in a Saudi Arabian man in 2012 [4].Bats are thought to be the reservoir hosts of MERS-CoV, and dromedary camels are considered to be the major intermediate host [5]; however, the transmission route from animals to humans is not well understood.Researchers tested 189 camel serum samples from 1983 to 1997 and found that 81% had neutralizing antibodies against MERS-CoV, suggesting long-term virus circulation in these animals [6].Similarly, COVID-19 was first reported on December 27, 2019, in Wuhan, China [7,8], and the Huanan seafood market was suspected to be the place of origin [9]; however, disputes remain.Pekar and colleagues explored the evolutionary dynamics of the first wave of SARS-CoV-2 infections in China using a strict clock Bayesian phylodynamic analysis but failed to capture the index case [10], probably because the redundant sequences were not removed, which usually influences the accuracy of time of MRCA (tMRCA) estimation, as indicated in two recent tMRCA analysis [11,12].Genome classification plays a critical role in tracing the origin of pathogenic viruses [3,12].We have previously classified SARS-CoV-2 genomes based on two amino acids, Spike-614 and Orf8-84, and revealed 16 haplotypes.From those, three major haplotypes were found to separately drive the development of the pandemic in China and the world.However, genome classification based on amino acid mutations did not rule out recombination and reverse mutations.In this paper, we provide detailed protocols to filter and classify the
$ echo 'EPI_ISL_402124' >accession.txt# EPI_ISL_402124 is the reference sequence used by the GISAID database and in many researches.

E. Retrieve genomes with complete, high coverage sequences and accurate dates and sampled from human hosts
The accession number, host, completeness, and coverage of the genomes are located in columns 3, 8, 18, and 19, respectively, in the metadata of April 29, 2022.The sample collection date is located in column 4. The column number may be different in the metadata downloaded at a different day.
1. Filter metadata.tsvfor accessions with complete genomes, with high coverage, and from human hosts.

G. All genome classification
1. Retrieve the six highly linked sites using the custom script fetch_nucleotides_from_alignments.pl(Figure 2). 2. Classify the genomes into haplotypes by the six linked mutation sites (Figure 3).

H. Classification of early genomes collected in the early phase of the pandemic (from beginning to end of April 2020)
The sample collection date is located in column 4 in the metadata of April 29, 2022.

I. Bayesian phylodynamic analysis using the early genomes of three haplotypes as examples
1. Filter out genomes with unknown higher than 0.05%.a. Filter out genomes of DS (CCTCAC) haplotypes with unknown nucleotides higher than 0.05%.

Figure 2 . 7 Published:
Figure 2. Output format of the six linked sites in the genomes.The six nucleotides of sites 241, 3037,
j. Visualize tMRCA by Tracer (Figure 4).# Open Tracer v1.7.2 by double click the icon # Import the log file created by BEAST.Posterior distribution of tMRCA can be shown by clicking 'age(root)' and 'Marginal Density'.The mean tMRCA and 95% HPD interval are provided in 'Estimates'.

Figure 4 .
Figure 4. Screenshot of time of most recent common ancestor (tMRCA) estimation of three main haplotypes of the early SARS-CoV-2 genomes as summarized with Tracer.The dates are shown in decimal.

1 .
Create a Python environment of conda.

B. Download the viral genomes and metadata 1
[16]wnload SARS-CoV-2 genomes and meta.tsvfilesfrom the GISAID database[16]after login (https://gisaid.org/).Note that the content in each column in the meta.tsvmay change (e.g., the accession numbers were put in column 3 in the metadata downloaded on May 1, 2022, but in column 5 in the metadata downloaded on May 1, 2023).2. Unpack the files. C.